-
-
Notifications
You must be signed in to change notification settings - Fork 33k
gh-138270: Use PyUnicodeWriter
in csv.writer
#138271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @vstinner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please:
- don't add comments for self-explanatory code;
- follow PEP-7 for C code;
- revert unrelated changes;
- provide benchmarks to show whether this speeds things up or not.
c == dialect->escapechar || | ||
c == dialect->quotechar) { | ||
if (dialect->escapechar == NOT_SET) { | ||
PyErr_SetString(self->error_obj, "need to escape, but no escapechar set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change this.
bool first_field_was_empty_like = false; | ||
bool first_field_was_none = false; | ||
bool first_field_was_quoted_in_loop = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are those now needed?
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
And yes, it could be meaningful to use PyUnicodeWriter instead of manual buffer constructions, but we need to check if this really improves things or not before deciding whether we can do it. |
Marking this as a draft. There's performance regression:
The script:import csv
import io
import pyperf
runner = pyperf.Runner()
INT_ROW = list(range(10))
COMPLEX_STRING_ROW = ['a,b', 'c"d', 'e\nf'] * 3 + ['ghi']
def write_the_rows(rows):
f = io.StringIO()
writer = csv.writer(f)
writer.writerows(rows)
for num_rows in (10, 1_000, 10_000, ):
int_rows = [INT_ROW] * num_rows
complex_rows = [COMPLEX_STRING_ROW] * num_rows
runner.bench_func(
f'writerows {num_rows} integer rows',
write_the_rows,
int_rows
)
runner.bench_func(
f'writerows {num_rows} complex string rows',
write_the_rows,
complex_rows
) There are two obvious issues:
I need some time to address it, perhaps with jumping similar to gh-138214. |
/* grow record buffer if necessary */ | ||
if (!join_check_rec_size(self, self->rec_len + terminator_len)) | ||
return 0; | ||
if (PyUnicodeWriter_WriteChar(writer, c) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to try calling WriteSubstring() at once rather than writing characters one by one.
The purpose of this PR is not performance but using the modern https://docs.python.org/dev/c-api/unicode.html#c.PyUnicodeWriter API, similarly to gh-125196.
There's a risk that the code is slower, as it turned out in gh-133968. I'd prefer optimizing it after getting an ack that this is the correct direction.
Similarly to #138214 (comment), I'm not sure what is the best benchmarking strategy, besides a simple snippet. Perhaps we need https://github.com/nineteendo/jsonyx-performance-tests but for CSV.
I believe that
csv.reader
(ReaderObj
) could also usePyUnicodeWriter
. If my thinking is sound, if there's any interest and this code is OK, I can handle it.csv
module should usePyUnicodeWriter
#138270